A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors
نویسندگان
چکیده
The LU decomposition is a widely used method to solve the dense linear algebra in many scientific computation applications. In recent years, the single instruction multiple data (SIMD) technology has been a popular method to accelerate the LU decomposition. However, the pipeline parallelism and memory bandwidth utilization are low when the LU decomposition mapped onto SIMD processors. This paper proposes a fine-grained pipelined implementation of LU decomposition on SIMD processors. The fine-grained algorithm well utilizes data dependences of the native algorithm to explore the fine-grained parallelism among all the computation resources. By transforming the non-coalesced memory access to coalesced version, the proposed algorithm can achieve the high pipeline parallelism and the high efficient memory access. Experimental results show that the proposed technology can achieve a speedup of 1.04x to 1.82x over the native algorithm and can achieve about 89% of the peak performance on the SIMD processor.
منابع مشابه
A Multiprocessor Architecture Combining Fine-Grained and Coarse-Grained Parallelism Strategies
A wide variety of computer architectures have been proposed that attempt to exploit parallelism at different granularities. For example, pipelined processors and multiple instruction issue processors exploit the fine-grained parallelism available at the machine instruction level, while shared memory multiprocessors exploit the coarse-grained parallelism available at the loop level. Using a regi...
متن کاملEfficient Exploitation of Parallelism on Pentium III and Pentium 4 Processor-Based Systems
Systems based on the Pentium III and Pentium 4 processors enable the exploitation of parallelism at a fineand medium-grained level. Dualand quad-processor systems, for example, enable the exploitation of mediumgrained parallelism by using multithreaded code that takes advantage of multiple control and arithmetic logic units. Streaming Single-Instruction-Multiple-Data (SIMD) extensions, on the o...
متن کاملA Domain-Specific Architecture for Elementary Function Evaluation
We propose a Domain-Specific Architecture for elementary function computation to improve throughput while reducing power consumption as a model for more general applications: support fine-grained parallelism by eliminating branches, eliminate the duplication required by co-processors by decomposing computation into instructions which fit existing pipelined execution models and standard register...
متن کاملImplementing Linear Algebra Routines on Multi-core Processors with Pipelining and a Look Ahead
Linear algebra algorithms commonly encapsulate parallelism in Basic Linear Algebra Subroutines (BLAS). This solution relies on the fork-join model of parallel execution, which may result in suboptimal performance on current and future generations of multi-core processors. To overcome the shortcomings of this approach a pipelined model of parallel execution is presented, and the idea of look ahe...
متن کاملFast parallel solver for the levelset equations on unstructured meshes
The levelset method is a numerical technique that tracks the evolution of curves and surfaces governed by a nonlinear partial differential equation (levelset equation). It has applications within various research areas such as physics, chemistry, fluid mechanics, computer vision, and microchip fabrication. Applying the levelset method entails solving a set of nonlinear partial differential equa...
متن کامل